First, we import all the needed librairies.
In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json
import math
import time
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
We noticed that the actual data from topuniversities is not directly on the webpage, but on a separate text file, in JSON format. Thus, we first get this JSON, parse it, and take the first 200 entries in it. We noticed that the univertsity with rank 199 is actually the 198th entry, and thus the last 3 universities needs to have their rank corrected.
In [2]:
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508259845358')
raw_data = json.loads(r.text)['data'][:200]
We can print the first entry of the data to see how the informations are represented.
In [3]:
raw_data[0]
Out[3]:
We can now define functions that will help us during the processing of this JSON.
First, process_university
takes as input the raw JSON of a particular university, and outputs a dictionnary containing the name, rank, country, region, number of faculty members (international and total) and number of students (international and total) for that given unviversity.
It uses other functions defined below.
In [4]:
def process_university(uni):
name = uni['title']
rank = get_rank(uni['rank_display'])
country = uni['country']
region = uni['region']
numbers = get_numbers(uni['url'])
info = {'name' : name, 'rank' : rank, 'country' : country, 'region' : region}
info.update(numbers)
return info
As there can be ties in rank, the displayed rank is not always a integer. Furthermore, as said above, the last 3 universities have incorrect ranks and need to be fixed.
In [5]:
def get_rank(rank_display):
rank = int(rank_display.replace("=", ""))
if rank >= 199:
rank -= 1
return rank
To get the number of faculty members (international and total) and number of students (international and total), we need to get another request, and this time, we will need to parse the webpage using BeautifulSoup.
By inspecting the webpage, we noticed the classes of the elements where the numbers are contained. Once we get these elements, we further need to parse its content, to get the value as an integer.
During the parsing, we noticed that one university (NYU) did not have the same template as the others, and so its number of students is unknown.
In [6]:
def get_numbers(url):
r = requests.get("https://www.topuniversities.com/" + url)
soup = BeautifulSoup(r.text, 'html.parser')
faculty_info = soup.select(".text .number")
if len(faculty_info) >= 2:
total_faculty = parse_int(faculty_info[0].decode_contents(formatter="html"))
international_faculty = parse_int(faculty_info[1].decode_contents(formatter="html"))
else:
total_faculty = math.nan
international_faculty = math.nan
student_info = soup.select(".barp .number")
if len(faculty_info) >= 2:
total_student = parse_int(student_info[0].decode_contents(formatter="html"))
international_student = parse_int(student_info[1].decode_contents(formatter="html"))
else:
total_student = math.nan
international_student = math.nan
return {'total_faculty' : total_faculty, 'international_faculty' : international_faculty, 'total_student' : total_student, 'international_student' : international_student}
In [7]:
def parse_int(str):
return int(str.replace("\n", "").replace(" ", "").replace(",", ""))
We put the gathered and parsed data of the universities in a new JSON file for later reuse so that we don't have to generate new requests every time we run the notebook. Then we create a dataframe from this data and display it.
In [8]:
# Uncomment and run this if you want to regenerate the JSON
"""unis1 = []
for uni in raw_data:
unis1.append(process_university(uni))
with open('data1.json', 'w') as f:
json.dump(unis1, f)"""
Out[8]:
In [9]:
with open('data1.json', 'r') as f:
unis1 = json.load(f)
In [10]:
df = pd.DataFrame(unis1)
df
Out[10]:
In [11]:
df['staff_student_ratio'] = df['total_faculty'] / df['total_student']
df.sort_values(['staff_student_ratio'], ascending=[False])[['name', 'rank', 'staff_student_ratio']]
Out[11]:
For clarity, we only show the universities with the 10 highest ratio.
In [12]:
df.sort_values('staff_student_ratio', ascending=False)[['name', 'staff_student_ratio']].head(10).plot(title='Rank according to the ratio between faculty members and students',
figsize=(13,6),kind='bar', x = 'name')
Out[12]:
We note that most of the universities with a high ratio are also highly ranked.
In [13]:
df['international_student_ratio'] = df['international_student'] / df['total_student']
df.sort_values(['international_student_ratio'], ascending=[False])[['name', 'rank', 'international_student_ratio']]
Out[13]:
For clarity, we only show the universities with the 10 highest ratio.
In [14]:
df.sort_values('international_student_ratio', ascending=False)[['name', 'international_student_ratio']].head(10).plot(title='Rank according to the ratio of international students',figsize=(13,6), kind='bar', x = 'name')
Out[14]:
We can note that again, a lot of higly ranked universities also have a high ratio, but in this case, it's not as much apparent as in the previous ratio.
In [15]:
df_staff_country = df.groupby('country').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_country
Out[15]:
In [16]:
df_staff_country['staff_student_ratio'].plot(title='Rank according to the ratio between faculty members and students, grouped by country', figsize=(16,6),kind='bar')
Out[16]:
Here we can note some surprising results. A lot of countries that have high ratios do not seem to have a lot of universities that are highly ranked. In fact we believe that for these countries, only a few or even a single university is in the ranking. Given that the university would be the best university in that country, it is understandable that their ratio is high.
In [17]:
df_staff_region = df.groupby('region').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_region
Out[17]:
In [18]:
df_staff_region['staff_student_ratio'].plot(title = 'Rank according to the ratio between faculty members and students, grouped by region',figsize=(13,6),kind='bar')
Out[18]:
Here we note that the highest ratios seem to be in regions that have the most of highly ranked universities.
In [19]:
df_int_country = df.groupby('country').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_country
Out[19]:
In [20]:
df_int_country['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by country', figsize=(16,6),kind='bar')
Out[20]:
We note that the countries that seem to have the highest ratios of international students also seem to be the most attractive in terms of their location and/or wealth.
In [21]:
df_int_region = df.groupby('region').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_region
Out[21]:
In [22]:
df_int_region['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by region', figsize=(13,6),kind='bar')
Out[22]:
The results by region seem to confirm the results by country. Again, attractive locations and rich countries seem to attract the most international students. Latin America and Asia however are an exception.
For the next part, where we look at the rankings of universities according to Times Higher Education, we need to define a mapping of countries and regions, since the second rankings do not contain region data and we would like to have it in our tables.
In [23]:
country_region = dict(df[['country', 'region']].groupby(['country', 'region']).groups.keys())
Similarly to the previous part, we notice that all the relevant data can be obtained from a simple text file in JSON format. Moreover we see that the ratios we are interested in are already present in this file and need not be calculated.
In [24]:
r2 = requests.get('https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json')
raw_data2 = json.loads(r2.text)['data'][:200]
When loading the university data from the webpage, we have to adapt it so that it matches the format from the other ranking. One problem we have is that in Times Higher Education, Russia is referred to as the Russian Federation, so we change it back to Russia. Moreover in the dictionary we created for the countries and regions, we do not have a region value for Luxemburg since it doesn't appear in the first ranking. So we manually put the region to Europe.
In [25]:
unis2 = []
for uni in raw_data2:
name = uni['name']
rank = uni['rank'].replace('=', '')
country = uni['location']
if country == 'Russian Federation':
country = 'Russia'
int_students = uni['stats_pc_intl_students'].replace('%', '')
staff_student = uni['stats_student_staff_ratio']
info = {'name' : name, 'rank': rank, 'country': country, 'region' : country_region.get(country, 'Europe'),
'international_student_ratio' : int(int_students) / 100.0, 'staff_student_ratio': 1 / float(staff_student)}
unis2.append(info)
As before, we put our data in a new JSON file to avoid reloading the file each time we run the notebook and generating new requests.
In [26]:
# Uncomment and run this if you want to regenerate the JSON
"""with open('data2.json', 'w') as f:
json.dump(unis2, f)"""
Out[26]:
In [27]:
with open('data2.json', 'r') as f:
all_unis2 = json.load(f)
df2 = pd.DataFrame(all_unis2)
df2
Out[27]:
In [28]:
df2[['name', 'staff_student_ratio']].sort_values('staff_student_ratio', ascending=False)
Out[28]:
For clarity, we only show the universities with the 10 highest ratio.
In [29]:
df2.sort_values('staff_student_ratio', ascending=False)[['name', 'staff_student_ratio']].head(10).plot(title='Rank according to the ratio between faculty members and students', figsize=(13,6),kind='bar', x = 'name')
Out[29]:
For this ranking, we see quite a difference with the previous ranking. Here, universities with a high ratio are not necessarily the universities with the highest ranking.
In [30]:
df2.sort_values('international_student_ratio', ascending=False)[['name', 'international_student_ratio']]
Out[30]:
For clarity, we only show the universities with the 10 highest ratio.
In [31]:
df2.sort_values('international_student_ratio', ascending=False)[['name', 'international_student_ratio']].head(10).plot(title='Rank according to the ratio of international students', figsize=(13,6),kind='bar', x = 'name')
Out[31]:
The results for this ranking seem to be very similar to the other ranking with the exception of additional unversities that were not present in the other ranking.
In [32]:
df_staff_country2 = df2.groupby('country').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_country2
Out[32]:
In [33]:
df_staff_country2['staff_student_ratio'].plot(title='Rank according to the ratio between faculty members and students, grouped by country', figsize=(16,6),kind='bar')
Out[33]:
We note similar results as in the first ranking.
In [34]:
df_staff_region2 = df2.groupby('region').mean().sort_values('staff_student_ratio', ascending=False)[['staff_student_ratio']]
df_staff_region2
Out[34]:
In [35]:
df_staff_region2['staff_student_ratio'].plot(title='Rank according to the ratio between faculty members and students, grouped by region', figsize=(13,6),kind='bar')
Out[35]:
We note a few differences in this result when comparing to the first ranking. First, we see that Africa is now in second place. What happened is that all the regions seem to have a lower average ratio than in the previous ranking, but Africa's average reduced the least. Moreober, we note the dissapearance of Latin America.
In [36]:
df_int_country2 = df2.groupby('country').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_country2
Out[36]:
In [37]:
df_int_country2['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by country', figsize=(16,6),kind='bar')
Out[37]:
The results here are very similar than the previous ranking with the exception of the addition of Luxemburg, which was not present in the previous ranking.
In [38]:
df_int_region2 = df2.groupby('region').mean().sort_values('international_student_ratio', ascending=False)[['international_student_ratio']]
df_int_region2
Out[38]:
In [39]:
df_int_region2['international_student_ratio'].plot(title='Rank according to the ratio of international students, grouped by region', figsize=(13,6),kind='bar')
Out[39]:
The results here are again very similar to the first ranking.
For the third question, the goal is to merge both dataframes, and thus, as sometimes the university names are not the same in both websites, we need to map them to the same value for both to perform the merge correctly.
To solve this problem, we used Google search : we search for each university names in both dataframes, and take the first link output by Google (usually the university website), and then use this URL to perform the join.
As Google blocks any device that performs too many searches, we delayed each search by 5 seconds, and then store the mapping in a JSON file to avoid searching again each time the notebook is run.
With this technique, we have a very high rate of success, with only 3 universities for which both searches didn't output the same link. These universities were fixed manually so if you save the automatically generated JSON again below, you will not have the same mapping as we do in the file that is present in the repository.
In [40]:
mapping = {}
with open('mapping.json', 'r') as f:
mapping = json.load(f)
In [41]:
def get_url(name):
r = requests.get('https://encrypted.google.com/search?q=' + name.replace(' ', '+')) # request the Google results page
soup = BeautifulSoup(r.text, 'html.parser')
google_url = soup.select('.g a')[0]['href'] # we get the first Google result
url = google_url[google_url.find("://")+3:google_url.find("&")] # we get the URL of the first result
if url.endswith("/"):
url = url[0:-1]
time.sleep(5) # the wait needed to avoid getting blocked by Google
return url
def get_identifier(name):
if not(name in mapping): #if the name is already in the mapping, no need to run the search again
mapping[name] = get_url(name)
return mapping[name]
Now that we have the mapping, we can apply it to both datasets and then merge them on the 'url' column.
In [42]:
df['url'] = df['name'].apply(get_identifier)
df2['url'] = df2['name'].apply(get_identifier)
raw_merge = df.merge(df2[['url', 'rank', 'international_student_ratio', 'staff_student_ratio']], how='inner', on='url')
raw_merge.columns = ['Country',
'International faculty',
'International students',
'Name',
'Rank topuniversities',
'Region',
'Total faculty',
'Total students',
'Faculty / students ratio topuniversities',
'International students ratio topuniversities',
'url',
'Rank timeshighereducation',
'International students ratio timeshighereducation',
'Faculty / students ratio timeshighereducation']
raw_merge
Out[42]:
In [43]:
# Uncomment and run this if you want to save the JSON
"""with open('mapping.json', 'w') as f:
json.dump(mapping, f)"""
Out[43]:
We begin by cleaning and adapting our dataset to better see meaningful correlations. For example, we put the ranking of each unversity in the reverse order, since we want to have a higher value for higher ranked universities to get meaningful correlations. Moreover we drop some rows where we miss the data we need. We also add another ratio that we thought to be usefull for the last part of the homework, the international faculy ratio for each unversity.
In [44]:
raw_merge_cleaned = raw_merge.copy()
raw_merge_cleaned['rank_time'] = 200 - raw_merge_cleaned['Rank timeshighereducation'].astype(int)
raw_merge_cleaned['rank_top'] = 200 - raw_merge_cleaned['Rank topuniversities'].astype(int)
raw_merge_cleaned[pd.notnull(raw_merge['International students ratio topuniversities'])]
raw_merge_cleaned['International faculty ratio'] = raw_merge_cleaned['International faculty'] / raw_merge_cleaned['Total faculty']
In [45]:
raw_merge_cleaned.corr()
Out[45]:
We notice that there is an obvious and natural correlation between the columns corresponding to the numbers of students and faculty members. Of course, it's normal than when there are more students, there are more international students as well and more faculty members. These correlations however are not interesting.
Another uninsteresting observation are the correlations between the ratios in the two rankings.
The more interesting observations are given by the ratios than the numbers.
In [46]:
raw_merge_cleaned[['Faculty / students ratio topuniversities',
'International students ratio topuniversities',
'International students ratio timeshighereducation',
'Faculty / students ratio timeshighereducation',
'International faculty ratio',
'rank_time',
'rank_top']].corr()
Out[46]:
When looking at only the correlations involving the ratios and ranks, we can note a few interesting things. First note the additional ratio that is the ratio of international faculty members (this information comes from the Top Universities ranking only).
First, we can note that the faculty/students ratio from Top Universities has a relatively high correlation with both ranks. There are two possible explanations in our opinion. First we can think that higher ranked universities may have more funds to hire more faculty staff. Secondly, we can also assume that universities with a higher rank may want to keep their high position or improve it by hiring more faculty members.
Another relatively important correlation we see is the correlation between international student ratio and both ranks. This can surely be explained by the attractiveness of high ranked universities to international students.
The highest correlation we observe is the one between international faculty ratio and the international students ratio. Universities that attract international students will also attract international faculty members for probably the same reason.
In [47]:
raw_merge_cleaned[['Total students',
'Total faculty',
'International faculty',
'International students',
'rank_time',
'rank_top']].corr()
Out[47]:
We can still take a quick look at the total numbers and their correlations. In fact we can see that there is almost no correlation between the total number of students and the rank. Moreover, we see also see that the higher the rank, the more faculty members there are in the university. This confirms our previous intuition when we looked at the ratios.
For the last question, we have the following approach. We will create new score of the universities based on the different ratio columns for which we have the correlations. Then we will calculate the weighted average of each of these scores for each university which will define the new ranking. The top scored university of this ranking will our best unversity.
The weights we use will be based on the correlations we found earlier and the explanations that we apply to them. We will apply the following weights for each collumns:
In [48]:
raw_merge_cleaned['score'] = (0.1 * raw_merge_cleaned['Faculty / students ratio topuniversities'] +
0.1 * raw_merge_cleaned['Faculty / students ratio timeshighereducation'] +
0.025 * raw_merge_cleaned['International students ratio topuniversities'] +
0.025 * raw_merge_cleaned['International students ratio timeshighereducation'] +
0.15 * raw_merge_cleaned['International faculty ratio'] +
0.3 * (raw_merge_cleaned['rank_time'] / 200.0) +
0.3 * (raw_merge_cleaned['rank_top'] / 200.0))
raw_merge_cleaned[['Name', 'score']].sort_values(by='score', ascending=False).reset_index(drop=True)
Out[48]:
According to our new ranking, the best university is the Massachusetts Institute of Technology.